Tboung Khmum Province
VLSP 2025 MLQA-TSR Challenge: Vietnamese Multimodal Legal Question Answering on Traffic Sign Regulation
Luu, Son T., Vo, Trung, Nguyen, Hiep, Tran, Khanh Quoc, Van Nguyen, Kiet, Tran, Vu, Nguyen, Ngan Luu-Thuy, Nguyen, Le-Minh
This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- Asia > Japan (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning
Tung, Luu Quy, Viet, Hoang Quoc, Loc, Pham Bao, Thu, Vo Trong
Chain-of-Thought (CoT) is a robust approach for tackling LLM tasks that require intermediate reasoning steps prior to generating a final answer. In this paper, we present GreenMind-Medium-14B-R1, the Vietnamese reasoning model inspired by the finetuning strategy based on Group Relative Policy Optimization. We also leverage a high-quality Vietnamese synthesized reasoning dataset and design two reward functions to tackle the main limitations of this technique: (i) language mixing, where we explicitly detect the presence of biased language characters during the process of sampling tokens, and (ii) we leverage Sentence Transformer-based models to ensure that the generated reasoning content maintains factual correctness and does not distort the final output. Experimental results on the Vietnamese dataset from the VLSP 2023 Challenge demonstrate that our model outperforms prior works and enhances linguistic consistency in its responses. Furthermore, we extend our evaluation to SeaExam-a multilingual multiple-choice dataset, showing the effectiveness of our reasoning method compared to few-shot prompting techniques.
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
Ding, Xuanwen, Pan, Chengjun, Li, Zejun, Zhang, Jiwen, Wang, Siyuan, Wei, Zhongyu
Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Mississippi (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (3 more...)
- Information Technology (0.46)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
Yan, Dawei, Li, Yang, Chen, Qing-Guo, Luo, Weihua, Wang, Peng, Zhang, Haokui, Shen, Chunhua
Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.
- North America > United States > Mississippi (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- (2 more...)
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Gu, Shuhao, Zhang, Jialing, Zhou, Siyuan, Yu, Kevin, Xing, Zhaohu, Wang, Liangdong, Cao, Zhou, Jia, Jintao, Zhang, Zhuoyi, Wang, Yixuan, Hu, Zhenchong, Zhang, Bo-Wen, Li, Jijie, Liang, Dong, Zhao, Yingli, Wang, Songjing, Ao, Yulong, Ju, Yiming, Ma, Huanhuan, Li, Xiaotong, Diao, Haiwen, Cui, Yufeng, Wang, Xinlong, Liu, Yaoqi, Feng, Fangxiang, Liu, Guang
Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.
- Europe > Austria > Vienna (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (5 more...)
Towards modeling evolving longitudinal health trajectories with a transformer-based deep learning model
Moen, Hans, Raj, Vishnu, Vabalas, Andrius, Perola, Markus, Kaski, Samuel, Ganna, Andrea, Marttinen, Pekka
Health registers contain rich information about individuals' health histories. Here our interest lies in understanding how individuals' health trajectories evolve in a nationwide longitudinal dataset with coded features, such as clinical codes, procedures, and drug purchases. We introduce a straightforward approach for training a Transformer-based deep learning model in a way that lets us analyze how individuals' trajectories change over time. This is achieved by modifying the training objective and by applying a causal attention mask. We focus here on a general task of predicting the onset of a range of common diseases in a given future forecast interval. However, instead of providing a single prediction about diagnoses that could occur in this forecast interval, our approach enable the model to provide continuous predictions at every time point up until, and conditioned on, the time of the forecast period. We find that this model performs comparably to other models, including a bi-directional transformer model, in terms of basic prediction performance while at the same time offering promising trajectory modeling properties. We explore a couple of ways to use this model for analyzing health trajectories and aiding in early detection of events that forecast possible later disease onsets. We hypothesize that this method may be helpful in continuous monitoring of peoples' health trajectories and enabling interventions in ongoing health trajectories, as well as being useful in retrospective analyses.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Asia > Cambodia > Tboung Khmum Province > Suong (0.04)
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Yue, Xiang, Song, Yueqi, Asai, Akari, Kim, Seungone, Nyandwi, Jean de Dieu, Khanuja, Simran, Kantharuban, Anjali, Sutawika, Lintang, Ramamoorthy, Sathyanarayanan, Neubig, Graham
Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.
- South America > Brazil (0.28)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (15 more...)
Tackling prediction tasks in relational databases with LLMs
Wydmuch, Marek, Borchmann, Łukasz, Graliński, Filip
Though large language models (LLMs) have demonstrated exceptional performance across numerous problems, their application to predictive tasks in relational databases remains largely unexplored. In this work, we address the notion that LLMs cannot yield satisfactory results on relational databases due to their interconnected tables, complex relationships, and heterogeneous data types. Using the recently introduced RelBench benchmark, we demonstrate that even a straightforward application of LLMs achieves competitive performance on these tasks. These findings establish LLMs as a promising new baseline for ML on relational databases and encourage further research in this direction.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
- Europe > Czechia > Prague (0.04)
- (3 more...)
Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
Doan, Khang T., Huynh, Bao G., Hoang, Dung T., Pham, Thuc D., Pham, Nhat H., Nguyen, Quan T. M., Vo, Bang Q., Hoang, Suong N.
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
- Asia > Vietnam > Bạc Liêu Province > Bạc Liêu (0.14)
- Asia > Vietnam > Khánh Hòa Province (0.05)
- Asia > Vietnam > Quảng Ninh Province (0.04)
- (4 more...)
TabularFM: An Open Framework For Tabular Foundational Models
Tran, Quan M., Hoang, Suong N., Nguyen, Lam M., Phan, Dzung, Lam, Hoang Thanh
Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York (0.04)
- (3 more...)